Overview:

You need a target criterion of model performance in order to strike a balance between over- and underfitting. Information theory provides a common and useful target, the out-of-sample deviance.

The path to understanding the relevance of out-of-sample deviance is not direct. Here are the steps ahead.
1. understand that joint probability, not average probability, is the right way to judge model accuracy.
2. use information theory to establish a measurement scale for distance from perfect accuracy.
3. establish deviance as an approximation of relative distance from perfect accuracy.
4. establish that it is only deviance out-of-sample that is of interest.

knitr::opts_chunk$set(
  fig.align='center', dpi = 300, 
  include=FALSE, echo=FALSE, message=FALSE, warning=FALSE
)
library(knitr)
library(magrittr)
library(modelr)
library(tidyverse)

source(here::here("file_paths.R"))

file_info_entropy <- paste_image_path("info_entropy.png")
file_divergence <- paste_image_path("divergence.png")
file_deviance <- paste_image_path("deviance.png")

6.2.1 "Establish that joint probability, not average probability, is the right way to judge model accuracy."

There are 2 major dimensions to consider when defining a target for model optimization:
1. Cost-benefit analysis. How high is the cost of an error? How large are the returns for correct predictions?
2. Accuracy in context. Some prediction tasks are just easier than others.

Using rate of correct prediction as your target can be problematic when it rewards expensive failures.

6.2.1.1 Costs & Benenfits

By assigning rewards and penalties to the direction of right and wrong predictions you can optimize for better types of accuracy than rate of correct prediction alone.

6.2.1.2 Measuring Accuracy

Maximizing joint probability, the probability of a predicition being right given the number of possible scenarios, is different than maximizing the average probability.

How should we describe the distance between a prediction and the observed event/target? The measure of distance should account for the ease of hitting the target. For instance it is easier to hit a 2D target (distance and width), than it is to hit a 3D target of distance & width at the right time.

As more dimensions are added to the target, the potential to miss and the potential to impress have both grown.

Keep in mind that the best model is not true. The best model provides the right probabilities given our state of ignorance. Probability is in the model not in the world. With complete information we could predict with certainty, but ignorant of many worldly factors we predict from a state of ignorance. A globe toss only seems random. We know that the globe is 70% water and so we use that to inform our model, but we are ignorant of starting point and torque and air resistance so we do not include those factors in the model. The model describes our state of ignorance.

6.2.2 Information and uncertainty

Information theory can help us to measure the distance of a model's accuracy from a target. Information theory frames the question as, "How much is our uncertainty reduced by learning an outcome?" For instance when a predicted event occurs, it is no longer uncertain. The reduction in uncertainty from predicted to occurred is the information gain / the reduction in uncertainty.

In this case, information is defined as the reduction in uncertainty from learning an outcome.

We need a function that will input the probabilities of rain and shine and output a measure of uncertainty about the weather on a future day.

A measure of uncertainty should meet 3 criteria 1. The measure of uncertainty should be continuous 2. The measure of uncertainty should increase as the number of possible events increases 3. The measure of uncertainty should be additive

The only function that satisfies these three criteria is Information Entropy.

n is the number of different possible events,
each event i has probability p~i~,
p is the list of probabilities, and
the unique measure of uncertainty we seek is:

include_graphics(file_info_entropy)

In plainer words: The uncertainty contained in a probability distribution is the average log-probability of an event.

__ But what does this mean? What is the "log-probability of an event" and why are we using it?

So H(p) is our measure of uncertainty.

For instance, in an area where the p~rain~ = .30 and p~shine~ = .70 we can calculate the information entropy as the average log-probability of all possible events.

p_rain = .30  
p_shine = .70

p <- c(p_rain, p_shine)

INFO_ENTRO_rain_30 <- -sum(p*log(p)) %>% round(2)

The information entropy for this area is about r INFO_ENTRO_rain_30.

In contrast consider an area where the p~shine~ is .99:

p_rain = .01  
p_shine = .99

p <- c(p_rain, p_shine)

INFO_ENTRO_rain_01 <- -sum(p*log(p)) %>% round(2)

The information entropy for this sunny area is about r INFO_ENTRO_rain_01.

6.2.3 From entropy to accuracy

If we think about the certainty of weather in Ireland vs Dubai, we can say that uncertainty is lower in Dubai than in Ireland. Information entropy quantifies uncertainty in units of entropy.

This is useful in model comparison because we can use information entropy to say how far a model is from its target.

The target distribution has its own entropy value, and another entropy, cross entropy, arises from using the model to predict the target. The difference between these two entropies, is divergence.

Divergence is the additional uncertainty induced by using probabilities from one distribution to describe another distribution.

include_graphics(file_divergence)

Divergence is equal to the average difference in log probability between the target (p) and the model (q). As q grows more dfferent from p the divergence also grows.

6.2.4 From divergence to deviance

Deviance is a measure of model fit. Deviance approximates the average log-probability of a model.

The absolute magnitude of each model's mean log-probability is not informative, but the difference in mean log-probability between two models is informative.

The deviance formula below approximates the relative value of E log(q~i~), the average log-probability of the candidate model q.

include_graphics(file_deviance)

i indexes each observation (case), and each q~i~ is just the likelihood of case i. The −2 is there for historical reasons.

I can't believe it. This refers to the log-likelihood of the data! Is log-likelihood stating the log-likelihood of the data under each model?

# fit model with lm 
m_ex <- lm(mpg ~ wt, mtcars) 

# compute deviance by cheating 
(-2) * logLik(m_ex)

6.2.5 From deviance to out-of-sample

Note: "in-sample" and "out-of-sample" just refer to "training" and "test" data/statistics.

Deviance in the training data will decline with each additional parameter added to the model, while deviance in the testing data will decrease then start to increase as models become overfit from too many parameters.

This fact forms the basis for understanding regularizing priors and information criteria in the next sections.

Wrap-up: From Discrete Probability Distribution --> Information Entropy --> Divergence --> Deviance

A real phenomenon like rain vs shine in 2018 has a true probability distribution. But before we know the real probabilities we could model the probability distribution of rain vs. shine. Now if the true discrete probability distribution of rain and shine is c(.3, .7).

p <- c(.3, .7)
ENTROPY_ACTUAL <- -sum(p*log(p)) %>% round(2)

We can say that the information entropy in the real distribution is about r ENTROPY_ACTUAL. Now suppose we use the first two months of 2018 to model the probability distribution of rain vs shine for all of 2018 and arrive at c(.25, .75). Well this modeled probability distribution has its own entropy value.

q <- c(.25, .75)
ENTROPY_MODEL <- -sum(q*log(q)) %>% round(2)

In this case the model's information entropy value is r ENTROPY_MODEL, and when we use one probability distribution to describe another we we introduce additional uncertainty beyond the uncertainty that was already present in the real probability distribution. We can call the additional uncertainty we have introduced KL-Divergence, and quantify it as the average difference in log probability between the actual probability dsitribution and the modeled probability distribution.

(DIVERGENCE <- sum(p * log(p/q)))

The closer our model to its target, the smaller our divergence value will be. So we can compare the accuracy of models by contrasting the size of each model's divergence from the target, and divergence is expressed in units of entropy.

The trick now is to estimate divergence in practical settings where the target distribution will be unknown.

How the hell does E log(q~i~) - E log(r~i~) tell me anything about p~i~?

I can see how H(p) subtracts out of both H(p,q) and H(p,r), but I cannot grasp how we can know whether q or r is closer to p, if p is unknown as it will be in real statistical modeling tasks. Author says we only need to know E log(q~i~) and E log(r~i~), and summing the log probabilities of each observed case will provide an approximation of E log(q~i~). It seems that just by virtue of having sample data, we no longer need p, just summing the log-probabilities of the observed order will adequately serve in place of p.

Just comparing the average log-probabilities of each model will give us a sense of their relative distance from the target. Even though the absolute magnitudes of E log(q~i~) and E log(r~i~) will be meaningless.

We can use a model's deviance to approximate the model's divergence. Below you are summing the logs of the likelihood of each observed outcome in your data.

sppnames <- c("afarensis", "africanus", "habilis", "boisei", "rudolfensis", "ergaster", "sapiens") 
brainvolcc <- c(438, 452, 612, 521, 752, 871, 1350) 
masskg <- c(37.0, 35.5, 34.5, 41.5, 55.5, 61.0, 53.5) 

d <- data.frame(species=sppnames, brain=brainvolcc, mass=masskg)

m6.1 <- lm(brain ~ mass, d)
-2 * logLik(m6.1)

# standardize the mass before fitting
d$mass.s <- (d$mass-mean(d$mass))/sd(d$mass)

m6.8 <-
  rethinking::map(
    alist(
        brain ~ dnorm( mu , sigma ),
        mu <- a + b*mass.s
    ),
    data = d,
    start = list( a=mean(d$brain), b=0, sigma=sd(d$brain) ),
    method = "Nelder-Mead" 
  )

# extract MAP estimates
theta <- rethinking::coef(m6.8)

# compute deviance
dev <- 
  (-2) * 
  sum( 
    dnorm(
      d$brain,
      mean = theta["a"]+theta["b"]*d$mass.s,
      sd = theta["sigma"],
      log = TRUE
    ) 
  )
dev

In this case our model's predicted distribution of r q has a divergence of r DEVIANCE_MODEL.

set.seed(57)

(ds_titanic <- as.data.frame(Titanic))

ds_titanic <- 
  ds_titanic %>% 
  mutate(
    Survived = fct_recode(Survived, "0" = "No", "1" = "Yes"),
    Survived = as.character(Survived),
    Survived = as.integer(Survived)
  ) %>% 
  uncount(Freq)

rownames(ds_titanic) <- c()

ds_titanic_100 <- 
  ds_titanic %>% 
  sample_n(100)

ds_titanic_100 %>% 
  count(Survived) %>% 
  mutate(prop = Survived / sum(Survived))

m1 <- 
  ds_titanic_100 %>% 
  glm(Survived ~ Sex, family = "binomial", data = .)

summary(m1)

m2 <- 
  ds_titanic_100 %>% 
  glm(Survived ~ Sex, family = "binomial", data = .)

summary(m2)
library(rethinking) 
data(milk) 
d <- milk

m5.10 <- 
  map( 
    alist( 
      kcal.per.g ~ dnorm(mu, sigma), 
      mu <- a + bf*perc.fat, 
      a ~ dnorm(0.6, 10), 
      bf ~ dnorm(0, 1), 
      sigma ~ dunif(0, 10) 
    ), 
  data = d)

precis(m5.10)
str(m5.10)


joepowers16/rethinking documentation built on June 2, 2019, 6:52 p.m.